Text Mining Kickstarter Projects

Overview

Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowd funding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life”.

Kickstarter has reportedly received almost $6 billion in pledges from 19.4 million backers to fund 200,000 creative projects, such as films, music, stage shows, comics, journalism, video games, technology and food-related projects.

For this assignment, I am asking you to analyze the descriptions of kickstarter projects to identify commonalities of successful (and unsuccessful projects) using the text mining techniques we covered in the past two lectures.

Data

The dataset for this assignment is taken from webroboto.io ‘s repository. They developed a scrapper robot that crawls all Kickstarter projects monthly since 2009. We will just take data from the most recent crawl on 2021-03-18.

To simplify your task, I have downloaded the files and partially cleaned the scraped data. In particular, I converted several JSON columns, corrected some obvious data issues, and removed some variables that are not of interest (or missing frequently), and removed some duplicated project entries. I have also subsetted the data to only contain projects originating in the United States (to have only English language and USD denominated projects). Some data issues surely remain, so please adjust as you find it necessary to complete the analysis.

The data is contained in the file kickstarter_projects_2021_03.csv and contains about 125k projects and about 20 variables.

Tasks for the Assignment

1. Identifying Successful Projects

a) Success by Category

There are several ways to identify success of a project:
- State (state): Whether a campaign was successful or not.
- Pledged Amount (pledged)
- Achievement Ratio: The variable achievement_ratio is calculating the percentage of the original monetary goal reached by the actual amount pledged (that is pledged\goal *100).
- Number of backers (backers_count)
- How quickly the goal was reached (difference between launched_at and state_changed_at) for those campaigns that were successful.

Use one or more of these measures to visually summarize which categories were most successful in attracting funding on kickstarter. Briefly summarize your findings.

kickstarter <- read.csv("kickstarter_projects_2021-03.csv", 
                        stringsAsFactors = FALSE)
colnames(kickstarter)
##  [1] "backers_count"            "blurb"                   
##  [3] "converted_pledged_amount" "country"                 
##  [5] "country_displayable_name" "created_at"              
##  [7] "currency"                 "deadline"                
##  [9] "goal"                     "id"                      
## [11] "is_starrable"             "launched_at"             
## [13] "name"                     "pledged"                 
## [15] "slug"                     "source_url"              
## [17] "spotlight"                "staff_pick"              
## [19] "state"                    "state_changed_at"        
## [21] "location_town"            "location_state"          
## [23] "top_category"             "sub_category"
unique(kickstarter$top_category)
##  [1] "crafts"       "dance"        "comics"       "design"       "technology"  
##  [6] "publishing"   "photography"  "music"        "art"          "film & video"
## [11] "food"         "games"        "fashion"      "journalism"

Here, I am using the Achievement Ratio as an example, which should be calculated use (pleaged \ goal * 100).

kickstarter %>%
  mutate(achievement_ratio = pledged / goal, 
         success = ifelse(state %in% c("failed", "canceled"), 
                          0,
                          1)) %>%
  group_by(top_category) %>%
  summarise(count = n(), 
            average_achievement_ratio = sum(achievement_ratio)/count,
            success_rate = sum(success) / count) %>%
  ungroup() %>%
  arrange(desc(average_achievement_ratio)) %>%
  ggplot(aes(x = fct_reorder(str_to_title(top_category), 
                             average_achievement_ratio), 
             y = average_achievement_ratio, 
             fill = success_rate))+
  geom_bar(stat = "identity")+
  coord_flip()+
  scale_fill_viridis(limits = c(0, 1), 
                     breaks = seq(0, 1, by = 0.2))+
  ggtitle("Success of Kickstarter Projects")+
  geom_hline(yintercept = 1.0, 
             linetype = "dashed")+
  labs(subtitle = "Average Achievement Ratio by Category", 
       y = "Average Achievement Ratio ($ pledged / $ goal)",
       x = "", 
       caption = "Source: Web Robots Kickstarter Dataset 2021")+
  guides(fill = guide_colorbar(title = "Success Rate", 
                               barwidth = 11, barheight = 1))+
  theme_fivethirtyeight()+
  theme(plot.title = element_text(size = 11, hjust = 0, face = "bold"),
        plot.subtitle = element_text(size = 10, hjust = 0), 
        axis.title.x = element_text(size = 10, hjust = 1, face = "bold"), 
        legend.title = element_text(size = 10), 
        legend.position = c(0.15, -0.15)) # legend position adjustment use vector

Findings

  • Based on the plot, looks like projects fall into categories such as music, comics experience both the high achievement ratio and success rate, while projects belong to Food, Dance, Photography, and Journalism has low achievement ratio.

  • Dance projects have low achievement ratio on average, but do have high success rate. Looks like on average, dance projects do meet their pledged goals, but the money they received does not always exceed the pledged amount.

BONUS ONLY: b) Success by Location

Now, use the location information to calculate the total number of successful projects by state (if you are ambitious, normalize by population). Also, identify the Top 50 “innovative” cities in the U.S. (by whatever measure you find plausible). Provide a leaflet map showing the most innovative states and cities in the U.S. on a single map based on these information.

success_projects_by_state <- kickstarter %>%
  mutate(success = ifelse(state %in% c("failed", "canceled"),
                          0,
                          1)) %>%
  group_by(location_state) %>%
  summarise(success_total = sum(success)) %>%
  ungroup() %>%
  arrange(desc(success_total))
head(success_projects_by_state)
## # A tibble: 6 x 2
##   location_state success_total
##   <chr>                  <dbl>
## 1 CA                     14751
## 2 NY                      9678
## 3 TX                      3939
## 4 WA                      2791
## 5 IL                      2784
## 6 FL                      2656
states_data <- data.frame(
  state_abb = state.abb,
  state_name = state.name
) %>%
  bind_rows(tibble(
    state_abb = "DC",
    state_name = "District of Columbia"
  ))
# Try to get population data
us_population_url <- getURL("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population")
# Third table of Wikipedia page
state_population <- readHTMLTable(us_population_url, which = 3, 
                                  header = FALSE, skip.rows = 1) %>%
  dplyr::select("state" = V1,
                "population" = V3) %>%
  mutate(population = as.numeric(gsub(pattern = "\\,", 
                                      replacement = "", x = population)))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
# Join datasets
combined_dataset <- success_projects_by_state %>%
  inner_join(states_data, by = c("location_state" = "state_abb")) %>%
  inner_join(state_population, by = c("state_name" = "state")) %>%
  mutate(population = population / 1000000, 
         success_by_population = success_total / population)
head(combined_dataset)
## # A tibble: 6 x 5
##   location_state success_total state_name population success_by_population
##   <chr>                  <dbl> <chr>           <dbl>                 <dbl>
## 1 CA                     14751 California      39.5                   373.
## 2 NY                      9678 New York        19.5                   497.
## 3 TX                      3939 Texas           29.0                   136.
## 4 WA                      2791 Washington       7.61                  367.
## 5 IL                      2784 Illinois        12.7                   220.
## 6 FL                      2656 Florida         21.5                   124.
# Retrieve US geographical dataset
states_sp <- states(cb = TRUE)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
# Subset the geographical dataset to include only the states that we have information
states_sp <- subset(states_sp, STUSPS %in% combined_dataset$location_state) %>%
  left_join(combined_dataset, by = c("STUSPS" = "location_state"))
range(states_sp$success_by_population)
## [1]   48.04867 1055.61609
bins <- seq(0, 1100, by = 220)
pal <- colorBin(palette = "viridis", 
                domain = states_sp$success_by_population, 
                bins = bins)
# Create labels for leaflet maps
states_sp$label <- paste("State:", states_sp$state_name, "<br>", 
                         "Population (in millions):", states_sp$population, "<br>", 
                         "Success by Population:", states_sp$success_by_population)
leaflet(data = states_sp, 
        options = leafletOptions(minZoom = 3, maxZoom = 6)) %>%
  addProviderTiles(providers$Esri.WorldGrayCanvas) %>%
  setView(lng = -96, lat = 37.8, zoom = 4) %>%
  addPolygons(weight = 1, 
              smoothFactor = 0.5,
              color = "#FFFFFF",
              fillColor = ~pal(states_sp$success_by_population), 
              fillOpacity = 0.8, 
              label = ~lapply(label, HTML), 
              labelOptions = list(textsize = "14px"),
              highlight = highlightOptions(
                bringToFront = TRUE,
                color = "white",
                weight = 5
              )) %>%
  addLegend(position = "bottomright",
            pal = pal,
            values = c(0, 1100),
            title = "Success by Population (in millions)")
## Warning: sf layer has inconsistent datum (+proj=longlat +datum=NAD83 +no_defs).
## Need '+proj=longlat +datum=WGS84'

I am not going to show the most innovative cities in the LeafLet maps because this assignment is focusing on text analysis.

2. Writing your success story

Each project contains a blurb – a short description of the project. While not the full description of the project, the short headline is arguably important for inducing interest in the project (and ultimately popularity and success). Let’s analyze the text.

a) Cleaning the Text and Word Cloud

To reduce the time for analysis, select the 1000 most successful projects and a sample of 1000 unsuccessful projects. Use the cleaning functions introduced in lecture (or write your own in addition) to remove unnecessary words (stop words), syntax, punctuation, numbers, white space etc. Note, that many projects use their own unique brand names in upper cases, so try to remove these fully capitalized words as well (since we are aiming to identify common words across descriptions). Create a document-term-matrix.

# Select successful projects and unsuccessful projects
kickstarter <- kickstarter %>%
  mutate(achievement_ratio = pledged / goal, 
         success = ifelse(state %in% c("failed", "canceled"), 
                          0, 
                          1))
successful_projects <- kickstarter %>%
  arrange(desc(achievement_ratio)) %>%
  top_n(1000, wt = achievement_ratio)

unsuccessful_projects <- kickstarter %>%
  filter(achievement_ratio < 1 & success == 0) %>%
  arrange(achievement_ratio) %>%
  top_n(1000, wt = achievement_ratio)
# Extract Some Brand Names
dict <- unique(unlist(str_extract_all(string = successful_projects$blurb, 
            pattern = SPC %R% one_or_more(char_class("A-Z")) %R% SPC)))
dict <- str_replace_all(dict, pattern = SPC, 
                        replacement = "")
dict2 <- unique(unlist(str_extract_all(string = unsuccessful_projects$blurb, 
                                       pattern = SPC %R% one_or_more(char_class("A-Z")) %R% SPC)))
dict2 <- str_replace_all(dict2, pattern = SPC, 
                         replacement = "")
combined_dict <- sort(unique(c(dict, dict2)))
# Remove single character word
combined_dict <- combined_dict[nchar(combined_dict) > 1]
# Create a projects categories dictionary
categories_dic <- unique(str_to_lower(kickstarter$top_category))
categories_dic <- unlist(str_split(categories_dic, pattern = SPC %R% "&" %R% SPC))
# Add singular form as well, while if you are ambitious, you should consider do word stemming
categories_dic <- c(categories_dic, "craft", "comic", "game")
categories_dic
##  [1] "crafts"      "dance"       "comics"      "design"      "technology" 
##  [6] "publishing"  "photography" "music"       "art"         "film"       
## [11] "video"       "food"        "games"       "fashion"     "journalism" 
## [16] "craft"       "comic"       "game"
clean_corpus <- function(corpus){
  # Replace Abbreviation
  corpus <- tm::tm_map(corpus, content_transformer(replace_abbreviation))
  # Replace Contraction
  corpus <- tm::tm_map(corpus, content_transformer(replace_contraction))
  # Replace Mixed Ordinal Numbers with Text Representation
  corpus <- tm::tm_map(corpus, content_transformer(replace_ordinal))
  # Replace Symbols
  corpus <- tm::tm_map(corpus, content_transformer(replace_symbol))
  # Remove Numbers
  corpus <- tm::tm_map(corpus, removeNumbers)
  # Remove Upper Case Words
  corpus <- tm::tm_map(corpus, removeWords, words = combined_dict)
  # Replace to all lower case
  corpus <- tm::tm_map(corpus, content_transformer(tolower))
  # Remove Stop Words and categories
  corpus <- tm::tm_map(corpus, removeWords, words = c(stop_words$word, # use tidytext stopwords
                                                      categories_dic))
  # Remove Punctuation
  corpus <- tm::tm_map(corpus, removePunctuation)
  # Remove white space
  corpus <- tm::tm_map(corpus, stripWhitespace)
  return(corpus)
}

Provide a word cloud of the most frequent or important words (your choice which frequency measure you choose) among the most successful projects.

# Remove emojis
successful_projects$blurb <- iconv(successful_projects$blurb, from = "UTF-8", 
                                   to = "ASCII//TRANSLIT", sub = "byte")
unsuccessful_projects$blurb <- iconv(unsuccessful_projects$blurb, from = "UTF-8", 
                                   to = "ASCII//TRANSLIT", sub = "byte")
success_blurbs <- paste(successful_projects$blurb, collapse = "")
# Change to Corpus
success_source <- VectorSource(success_blurbs)
success_corpus <- VCorpus(success_source)
# Change to Term Document Matrix
success_clean <- TermDocumentMatrix(clean_corpus(success_corpus))
# Change to matrix
success_clean_m <- as.matrix(success_clean)
# Change to dataframe
success_clean_df <- success_clean_m %>%
  as_tibble(rownames = "word") %>%
  rename("frequency" = `1`)
success_top100 <- success_clean_df %>%
  arrange(desc(frequency)) %>%
  top_n(100, wt = frequency)
# You should have viridisLite package installed
color_pal <- viridis(n = 5)
set.seed(12345)
wordcloud(words = success_top100$word, 
          freq = success_top100$frequency,
          max.words = 100,
          colors = color_pal, 
          random.order = FALSE)
title(main = "Top 100 Words Cloud for Successful Projects")

There are lots of good words, such as “inspired”, “smart”, “favorite”, “love”, “magic”, “fantasy” etc. But may be sentiment analysis could help us get better ideas.

b) Success in words

Provide a pyramid plot to show how the words between successful and unsuccessful projects differ in frequency. A selection of 10 - 20 top words is sufficient here.

unsuccess_blurbs <- paste(unsuccessful_projects$blurb, collapse = "")
all_projects <- c(success_blurbs, unsuccess_blurbs)
# Combine the source
all_projects_source <- tm::VectorSource(all_projects)
all_projects_corpus <- tm::VCorpus(all_projects_source)
# Clean the corpus and change to TDM
all_projects_tdm <- TermDocumentMatrix(clean_corpus(all_projects_corpus))
all_project_tdm_m <- as.matrix(all_projects_tdm)
colnames(all_project_tdm_m) <- c("Success", "Unsuccess")
all_project_tdm_df <- all_project_tdm_m %>%
  as_tibble(rownames = "word")
# Pyramid Plot
mydata <- all_project_tdm_df %>%
  filter(Success > 1 & Unsuccess > 1) %>%
  mutate(diff = Success - Unsuccess) %>%
  arrange(desc(diff)) %>%
  top_n(20, wt = diff) %>%
  dplyr::select(word, Success, Unsuccess) %>%
  mutate(Unsuccess = 0-Unsuccess) %>%
  pivot_longer(cols = Success:Unsuccess, 
               names_to = c("project"), 
               values_to = "count")

ggplot(mydata, aes(x = fct_reorder(word, count), y = count, fill = project))+
  geom_bar(data = subset(mydata, project == "Success"), stat = "identity")+
  geom_bar(data = subset(mydata, project == "Unsuccess"), stat = "identity")+
  scale_fill_manual(values = c("red", "blue"))+
  scale_y_continuous(limits = c(-50, 100))+
  labs(title = "Words in Common, Successful v.s Unsuccessful Projects",
       x = "Terms",
       y = "",
       caption = "Source: Web Robats Kickstarter Dataset 2021")+
  coord_flip()+
  guides(fill = guide_legend(title = "Project"))+
  theme_bw()+
  theme(plot.title = element_text(size = 10, face = "bold", hjust = 0), 
        axis.text.y = element_text(face = "bold"), 
        axis.text.x = element_blank())

c) Simplicity as a virtue

These blurbs are short in length (max. 150 characters) but let’s see whether brevity and simplicity still matters. Calculate a readability measure (Flesh Reading Ease, Flesh Kincaid or any other comparable measure) for the texts. Visualize the relationship between the readability measure and one of the measures of success. Briefly comment on your finding.

# Now we need to use a Document Term Matrix
document <- successful_projects %>%
  dplyr::select(id, blurb, achievement_ratio, success) %>%
  bind_rows(
    unsuccessful_projects %>% dplyr::select(id, blurb, achievement_ratio, success)) %>%
  rename("doc_id" = id, 
         "text" = blurb)
# Use DataFrame Source
document_source <- tm::DataframeSource(document)
# Change to corpus
document_corpus <- tm::VCorpus(document_source)
# Clean corpus
document_corpus_clean <- clean_corpus(document_corpus)
# Convert to Document Term Matrix with cleaned corpus
document_dtm_clean <- DocumentTermMatrix(document_corpus_clean)
# Check the first document content
content(document_corpus_clean[[1]])
## [1] " installment popular covers friends albu miss time charm"
# Access meta data
NLP::meta(document_corpus_clean[1])
##   achievement_ratio success
## 1           68764.1       1
# Calculate the Readability Measure
corp_quanteda <- quanteda::corpus(document_corpus_clean)
# Check docvars
head(docvars(corp_quanteda)$id)
## [1] "1947298033" "1932294007" "246326560"  "521903377"  "550443638" 
## [6] "1077219132"
quanteda_df <- quanteda::textstat_readability(corp_quanteda, 
                                              measure = c("Flesch.Kincaid")) %>%
  mutate(document = docvars(corp_quanteda)$id)
quanteda_df %>%
  inner_join(document %>% mutate(doc_id = as.character(doc_id)), 
             by = c("document" = "doc_id")) %>%
  ggplot(aes(x = achievement_ratio, y = Flesch.Kincaid))+
  geom_point(aes(color = as.factor(success)))+
  scale_x_log10(labels = scales::comma, 
                n.breaks = 6)+
  geom_smooth()+
  scale_color_discrete("Project Status", labels = c("Unsuccess", "Success"))+
  theme_fivethirtyeight()+
  labs(title = "Achievement Ratio v.s Sentence Complexity",
       x = "Achievment Ratio (log-transformed)",
       y = "Flesh Kincaid Measurement")+
  theme(plot.title = element_text(size = 11, face = "bold", hjust = 0),
        axis.title.x = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title.y = element_text(size = 10, vjust = 1, face = "bold"),
        legend.position = "top")

No clear relationship.

3. Sentiment

Now, let’s check whether the use of positive / negative words or specific emotions helps a project to be successful.

a) Stay positive

Calculate the tone of each text based on the positive and negative words that are being used. You can rely on the Hu & Liu dictionary provided in lecture or use the Bing dictionary contained in the tidytext package (tidytext::sentiments). Visualize the relationship between tone of the document and success. Briefly comment.

# Now We can use a tidytext approach
df2 <- as_data_frame(document_corpus_clean) %>%
  drop_na()

df2 %>%
  unnest_tokens(word, text) %>%
  inner_join(sentiments, by = "word") %>%
  mutate(sentiment = ifelse(sentiment == "negative", 
                            -1, 
                            1)) %>%
  group_by(doc_id, achievement_ratio, success) %>%
  summarise(sentiment_score = sum(sentiment)) %>%
  ungroup() %>% 
  ggplot(aes(x = as.factor(success), y = sentiment_score, 
             color = as.factor(success)))+
  stat_boxplot(geom = "errorbar", width = 0.15)+
  scale_x_discrete(labels = c("Unsuccess", "Success"))+
  geom_boxplot()+
  theme_fivethirtyeight()+
  # ggplot(aes(x = sentiment_score, y = achievement_ratio, 
  #            color = as.factor(success)))+
  # geom_point()+
  # geom_smooth(se = FALSE)+
  # scale_y_log10(labels = scales::comma,
  #               n.breaks = 6)+
  # scale_color_discrete(labels = c("Unsuccess", "Success"))+
  # theme_fivethirtyeight()+
  labs(title = "Sentiment Scores by Project Status",
       y = "Sentiment Score",
       x = "")+
  guides(color = FALSE)+
  theme(plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 10, face = "bold", vjust = 1),
        legend.position = "top")

b) Positive vs negative

Segregate all 2,000 blurbs into positive and negative texts based on their polarity score calculated in step (a). Now, collapse the positive and negative texts into two larger documents. Create a document-term-matrix based on this collapsed set of two documents. Generate a comparison cloud showing the most-frequent positive and negative words.

df3 <- df2 %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(doc_id) %>%
  summarise(sentiment_score = sum(value)) %>%
  ungroup() %>%
  mutate(document_category = ifelse(sentiment_score > 0, 
                           "positive", 
                           "negative"))
positive_documents_id <- df3 %>% filter(document_category == "positive") %>% dplyr::select(doc_id) %>% as_vector()

positive_documents <- document %>% filter(doc_id %in% positive_documents_id) %>% dplyr::select(text)

negative_documents_id <- df3 %>% filter(document_category == "negative") %>% dplyr::select(doc_id) %>% as_vector()

negative_documents <- document %>% filter(doc_id %in% negative_documents_id) %>% dplyr::select(text)
combined_sentiment_vector <- c(paste(positive_documents$text, collapse = ""),
                               paste(negative_documents$text, collapse = ""))
combined_source <- tm::VectorSource(combined_sentiment_vector)
combined_corpus <- tm::VCorpus(combined_source)
combined_corpus_clean <- clean_corpus(combined_corpus)
combined_tdm <- TermDocumentMatrix(combined_corpus_clean)
combined_tdm_m <- as.matrix(combined_tdm)
colnames(combined_tdm_m) <- c("Positive", "Negative")
set.seed(12345)
comparison.cloud(combined_tdm_m, 
                 colors = c("#EB8A44", "#75B1A9"),
                 max.words = 150, 
                 title.size = 1.5, 
                 scale = c(4, 0.10))

c) Get in their mind

Now, use the NRC Word-Emotion Association Lexicon in the tidytext package to identify a larger set of emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). Again, visualize the relationship between the use of words from these categories and success. What is your finding?

df4 <- df2 %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  group_by(success) %>%
  count(sentiment) %>%
  mutate(percent = n / sum(n)) %>%
  ungroup() %>%
  mutate(percent = ifelse(success == 0,
                          -percent,
                          percent))

ggplot(data = df4, aes(x = fct_reorder(sentiment, percent),
             y = percent, fill = as.factor(success)))+
  geom_bar(data = df4 %>% filter(success == 1), stat = "identity")+
  geom_bar(data = df4 %>% filter(success == 0), stat = "identity")+
  scale_fill_manual("Project Status",
                    values = c("red", "blue"),
                    labels = c("Unsuccess", "Success"))+
  scale_y_continuous(labels = scales::percent,
                     limits = c(-0.30, 0.30),
                     breaks = seq(-0.30, 0.30, by = 0.1))+
  coord_flip()+
  theme_bw()+
  labs(title = "NRC Word-Emotion Lexicon by Project Status",
       y = "Percent",
       x = "NRC Emotions Categories")+
  theme(plot.title = element_text(size = 11, face = "bold", hjust = 0.5),
        axis.title.y = element_text(size = 10, face = "bold", vjust = 1))

Very close results!

Submission

Please follow the instructions to submit your homework. The homework is due on Wednesday, March 31.

Please stay honest!

If you do come across something online that provides part of the analysis / code etc., please no wholesale copying of other ideas. We are trying to evaluate your abilities to visualized data not the ability to do internet searches. Also, this is an individually assigned exercise – please keep your solution to yourself.